Cluster based Mixed Coding Schemes for Inverted File Index Compression
نویسندگان
چکیده
One way to improve inverted file compression is to use the cluster property [1] of document collection, which states that term occurrences are not uniformly distributed. Some terms are more frequently used in some parts of the collection than in others. The corresponding part of the inverted list will consequently be small d-gap values clustered. Interpolative code [9] exploits the cluster property of term occurrences and achieves very good performance. Other codes that favor small d-gaps also perform well on document collections with cluster property.
منابع مشابه
A New Compression Based Index Structure for Efficient Information Retrieval
Finding desired information from large data set is a difficult problem. Information retrieval is concerned with the structure, analysis, organization, storage, searching, and retrieval of information. Index is the main constituent of an IR system. Now a day exponential growth of information makes the index structure large enough affecting the IR system’s quality. So compressing the Index struct...
متن کاملRe-Ordered FEGC and Block Based FEGC for Inverted File Compression
Data compression has been widely used in many Information Retrieval based applications like web search engines, digital libraries, etc. to enable the retrieval of data to be faster. In these applications, universal codes (Elias codes (EC), Fibonacci code (FC), Rice code (RC), Extended Golomb code (EGC), Fast Extended Golomb code (FEGC) etc.) have been preferably used than statistical codes (Huf...
متن کاملRe-Pair Compression of Inverted Lists
Compression of inverted lists with methods that support fast intersection operations is an active research topic. Most compression schemes rely on encoding differences between consecutive positions with techniques that favor small numbers. In this paper we explore a completely different alternative: We use Re-Pair compression of those differences. While Re-Pair by itself offers fast decompressi...
متن کاملOn the Impact of Random Index-Partitioning on Index Compression
The performance of processing search queries depends heavily on the stored index size. Accordingly, considerable research efforts have been devoted to the development of efficient compression techniques for inverted indexes. Roughly, index compression relies on two factors: the ordering of the indexed documents, which strives to position similar documents in proximity, and the encoding of the i...
متن کاملOptimize Document Identifier Assignment for Inverted Index Compression
Document identifier assignment is a technique for inverted file index compression, by reducing d-gap value of posting lists. It was approached by either TSP or clustering methods in existing study. However, there is no proper formulation for this problem and the existing approaches has no theory guarantee to be good approximations. In this paper, we first formulate document identifier assignmen...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- JDIM
دوره 6 شماره
صفحات -
تاریخ انتشار 2008